List of AI News about Sparse Attention
| Time | Details |
|---|---|
|
2026-04-26 08:07 |
Sparse Attention Breakthrough Slashes 128K Context Costs by 60%: Techniques to Scale LLM Context Windows [2026 Analysis]
According to @_avichawla on X, moving to sparse attention at 128K tokens cuts prefilling cost from about $0.65 to $0.35 per million tokens and decoding from about $2.4 to $0.8, with equal or better long-context performance on V3.2. As reported by the post, sparse attention can preserve quality when engineered carefully, opening room for larger context windows without prohibitive inference costs. According to research cited broadly in industry literature, additional techniques to extend context include Rotary or YaRN position scaling to stabilize very long sequences, linear attention variants such as Performer or Hyena to reduce quadratic complexity, retrieval-augmented generation to offload context to external memory, chunking with cross-attention bridges for hierarchical conditioning, sliding-window or recurrent state compression to maintain continuity, and test-time attention sinks or key-value cache eviction policies to cap memory growth. For businesses, these methods can lower serving costs, improve long-document QA, contract analysis, code comprehension, and multimodal transcripts, while maintaining accuracy at scale, according to common enterprise LLM deployment case studies. |
|
2026-04-26 08:07 |
DeepSeek V3.2 DSA Breakthrough: O(Lk) Sparse Attention Slashes 128K-Context Compute by Selecting Top‑k Tokens
According to @_avichawla on Twitter, DeepSeek’s V3.2 introduces DeepSeek Sparse Attention (DSA) that reduces attention complexity from O(L²) to O(Lk) by selecting only the top‑k key‑value pairs per query, capped at 2048 tokens regardless of a 128K context. As reported by @_avichawla, a lightweight Lightning Indexer ranks salient tokens using a small number of FP8 heads, enabling a compute‑cheap preselection step before running the expensive attention on the subset. According to the tweet, this design concentrates GPU FLOPs on useful tokens, offering lower latency and cost for long‑context inference and enabling scalable retrieval‑augmented generation and document intelligence workloads. As reported by the same source, the fixed k makes memory and compute predictable, which can translate into higher throughput per GPU and improved serving economics for enterprise long‑context applications. |
|
2026-04-26 08:06 |
Sparse Attention in Transformers: 3 Practical Patterns, Trade offs, and 2026 Efficiency Trends – Analysis
According to @_avichawla on Twitter, sparse attention restricts attention to a subset of tokens via local windows and learned selection, reducing quadratic compute with a performance trade off. As reported by Avi Chawla’s post, practitioners combine local sliding windows, block sparse patterns, and learned top k routing to scale longer contexts at lower cost. According to research commonly cited alongside sparse attention such as Longformer and BigBird, these patterns cut memory and latency for multi head attention while preserving accuracy on long sequence tasks; this highlights business opportunities for cost efficient inference, on device LLMs, and long context RAG pipelines. According to the tweet, teams must balance computational complexity versus model quality when choosing window size, block patterns, and sparsity schedules, which directly impacts throughput, GPU memory planning, and serving costs. |
|
2025-09-29 10:10 |
DeepSeek-V3.2-Exp Launches with Sparse Attention for Faster AI Model Training and 50% API Price Drop
According to DeepSeek (@deepseek_ai), the company has launched DeepSeek-V3.2-Exp, an experimental AI model built on the V3.1-Terminus architecture. This release introduces DeepSeek Sparse Attention (DSA), a technology designed to enhance training and inference speed, particularly for long-context natural language processing tasks. The model is now accessible via app, web, and API platforms, with API pricing reduced by more than 50%. This development signals significant opportunities for businesses seeking affordable, high-performance AI solutions for long-form content analysis and enterprise applications (source: DeepSeek, Twitter). |